Unsupervised morphological parsing of Bengali

نویسندگان

  • Sajib Dasgupta
  • Vincent Ng
چکیده

Unsupervised morphological analysis is the task of segmenting words into prefixes, suffixes and stems without prior knowledge of language-specific morphotactics and morpho-phonological rules. This paper introduces a simple, yet highly effective algorithm for unsupervised morphological learning for Bengali, an Indo-Aryan language that is highly inflectional in nature. When evaluated on a set of 4110 humansegmented Bengali words, our algorithm achieves an F-score of 83%, substantially outperforming Linguistica, one of the most widely-used unsupervised morphological parsers, by about 23%. Response to Reviewers: We have added reviewers' comments as an attachment to the manuscript. Unsupervised Morphological Parsing of Bengali Sajib Dasgupta and Vincent Ng ({sajib,vince}@hlt.utdallas.edu) Human Language Technology Research Institute, University of Texas at Dallas, Richardson, TX 75083, USA Abstract. Unsupervised morphological analysis is the task of segmenting words into prefixes, Unsupervised morphological analysis is the task of segmenting words into prefixes, suffixes and stems without prior knowledge of language-specific morphotactics and morphophonological rules. This paper introduces a simple, yet highly effective algorithm for unsupervised morphological learning for Bengali, an Indo-Aryan language that is highly inflectional in nature. When evaluated on a set of 4110 human-segmented Bengali words, our algorithm achieves an F-score of 83%, substantially outperforming Linguistica, one of the most widelyused unsupervised morphological parsers, by about 23%.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

High-Performance, Language-Independent Morphological Segmentation

This paper introduces an unsupervised morphological segmentation algorithm that shows robust performance for four languages with different levels of morphological complexity. In particular, our algorithm outperforms Goldsmith’s Linguistica and Creutz and Lagus’s Morphessor for English and Bengali, and achieves performance that is comparable to the best results for all three PASCAL evaluation da...

متن کامل

Unity in Diversity: A Unified Parsing Strategy for Major Indian Languages

This paper presents our work to apply non linear neural network for parsing five r esource p oor I ndian L anguages belonging to two major language families Indo-Aryan and Dravidian. Bengali and Marathi are Indo-Aryan languages whereas Kannada, Telugu and Malayalam belong to the Dravidian family. While little work has been done previously on Bengali and Telugu linear transition-based parsing, w...

متن کامل

A Hybrid Model for Part-of-Speech Tagging and its Application to Bengali

— This paper describes our work on Bengali Part of Speech (POS) tagging using a corpus-based approach. There are several approaches for part of speech tagging. This paper deals with a model that uses a combination of supervised and unsupervised learning using a Hidden Markov Model (HMM). We make use of small tagged corpus and a large untagged corpus. We also make use of Morphological Analyzer. ...

متن کامل

Example Based English-Bengali Machine Translation Using WordNet

In this paper we propose an architecture of EnglishBengali Example Based Machine Translation (EBMT) using WordNet. The proposed EBMT system has five steps: 1) Tagging 2) Parsing 3) Prepare the chunks of the sentence using sub-sentential EBMT 4) Using an efficient adapting scheme, match the sentence rule 5) Translate from Source Language (English) to Target Language (Bengali) in the chunk and ge...

متن کامل

Unsupervised Morphological Segmentation with Recursive Neural Network

Motivated by (Socher et al., 2010; 2011)’s work in syntactic parsing of natural language sentences, where the input is a sequence of words, our goal is to learn similar hierarchical parse trees but for words instead, treating each character as a unit. By recursively grouping characters together, we aim to achieve unsupervised learning of not only the shallow morphological segmentation, i.e. bre...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Language Resources and Evaluation

دوره 40  شماره 

صفحات  -

تاریخ انتشار 2006